IEICE global.ieice.org Site

Author Search Result

[Author] Satoshi NAKAMURA(56hit)

41-56hit(56hit)

A Physical Design Method for a New Memory-Based Reconfigurable Architecture without Switch Blocks
Masatoshi NAKAMURA Masato INAGI Kazuya TANIGAWA Tetsuo HIRONAKA Masayuki SATO Takashi ISHIGURO

PAPER-Design Methodology

Vol:
E95-D No:2
Page(s):
324-334
In this paper, we propose a placement and routing method for a new memory-based programmable logic device (MPLD) and confirm its capability by placing and routing benchmark circuits. An MPLD consists of multiple-output look-up tables (MLUTs) that can be used as logic and/or routing elements, whereas field programmable gate arrays (FPGAs) consist of LUTs (logic elements) and switch blocks (routing elements). MPLDs contain logic circuits more efficiently than FPGAs because of their flexibility and area efficiency. However, directly applying the existing placement and routing algorithms of FPGAs to MPLDs overcrowds the placed logic cells and causes a shortage of routing domains between logic cells. Our simulated annealing-based method considers the detailed wire congestion and nearness between logic cells based on the cost function and reserves the area for routing. In the experiments, our method reduced wire congestion and successfully placed and routed 27 out of 31 circuits, 13 of which could not be placed or routed using the versatile place and route tool (VPR), a well-known method for FPGAs.
Construction of Audio-Visual Speech Corpus Using Motion-Capture System and Corpus Based Facial Animation
Tatsuo YOTSUKURA Shigeo MORISHIMA Satoshi NAKAMURA

PAPER

Vol:
E88-D No:11
Page(s):
2477-2483
An accurate audio-visual speech corpus is inevitable for talking-heads research. This paper presents our audio-visual speech corpus collection and proposes a head-movement normalization method and a facial motion generation method. The audio-visual corpus contains speech data, movie data on faces, and positions and movements of facial organs. The corpus consists of Japanese phoneme-balanced sentences uttered by a female native speaker. An accurate facial capture is realized by using an optical motion-capture system. We captured high-resolution 3D data by arranging many markers on the speaker's face. In addition, we propose a method of acquiring the facial movements and removing head movements by using affine transformation for computing displacements of pure facial organs. Finally, in order to easily create facial animation from this motion data, we propose a technique assigning the captured data to the facial polygon model. Evaluation results demonstrate the effectiveness of the proposed facial motion generation method and show the relationship between the number of markers and errors.
A Bayesian Model of Transliteration and Its Human Evaluation When Integrated into a Machine Translation System
Andrew FINCH Keiji YASUDA Hideo OKUMA Eiichiro SUMITA Satoshi NAKAMURA

PAPER

Vol:
E94-D No:10
Page(s):
1889-1900
The contribution of this paper is two-fold. Firstly, we conduct a large-scale real-world evaluation of the effectiveness of integrating an automatic transliteration system with a machine translation system. A human evaluation is usually preferable to an automatic evaluation, and in the case of this evaluation especially so, since the common machine translation evaluation methods are affected by the length of the translations they are evaluating, often being biassed towards translations in terms of their length rather than the information they convey. We evaluate our transliteration system on data collected in field experiments conducted all over Japan. Our results conclusively show that using a transliteration system can improve machine translation quality when translating unknown words. Our second contribution is to propose a novel Bayesian model for unsupervised bilingual character sequence segmentation of corpora for transliteration. The system is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the overfitting problem inherent in maximum likelihood training. We demonstrate the effectiveness of our Bayesian segmentation by using it to build a translation model for a phrase-based statistical machine translation (SMT) system trained to perform transliteration by monotonic transduction from character sequence to character sequence. The Bayesian segmentation was used to construct a phrase-table and we compared the quality of this phrase-table to one generated in the usual manner by the state-of-the-art GIZA++ word alignment process used in combination with phrase extraction heuristics from the MOSES statistical machine translation system, by using both to perform transliteration generation within an identical framework. In our experiments on English-Japanese data from the NEWS2010 transliteration generation shared task, we used our technique to bilingually co-segment the training corpus. We then derived a phrase-table from the segmentation from the sample at the final iteration of the training procedure, and the resulting phrase-table was used to directly substitute for the phrase-table extracted by using GIZA++/MOSES. The phrase-table resulting from our Bayesian segmentation model was approximately 30% smaller than that produced by the SMT system's training procedure, and gave an increase in transliteration quality measured in terms of both word accuracy and F-score.
Using Mutual Information Criterion to Design an Efficient Phoneme Set for Chinese Speech Recognition
Jin-Song ZHANG Xin-Hui HU Satoshi NAKAMURA

PAPER-Acoustic Modeling

Vol:
E91-D No:3
Page(s):
508-513
Chinese is a representative tonal language, and it has been an attractive topic of how to process tone information in the state-of-the-art large vocabulary speech recognition system. This paper presents a novel way to derive an efficient phoneme set of tone-dependent units to build a recognition system, by iteratively merging a pair of tone-dependent units according to the principle of minimal loss of the Mutual Information (MI). The mutual information is measured between the word tokens and their phoneme transcriptions in a training text corpus, based on the system lexical and language model. The approach has a capability to keep discriminative tonal (and phoneme) contrasts that are most helpful for disambiguating homophone words due to lack of tones, and merge those tonal (and phoneme) contrasts that are not important for word disambiguation for the recognition task. This enables a flexible selection of phoneme set according to a balance between the MI information amount and the number of phonemes. We applied the method to traditional phoneme set of Initial/Finals, and derived several phoneme sets with different number of units. Speech recognition experiments using the derived sets showed its effectiveness.
An Unsupervised Model of Redundancy for Answer Validation
Youzheng WU Hideki KASHIOKA Satoshi NAKAMURA

PAPER-Natural Language Processing

Vol:
E93-D No:3
Page(s):
624-634
Given a question and a set of its candidate answers, the task of answer validation (AV) aims to return a Boolean value indicating whether a given candidate answer is the correct answer to the question. Unlike previous works, this paper presents an unsupervised model, called the U-model, for AV. This approach regards AV as a classification task and investigates how effectively using redundancy of the Web into the proposed architecture. Experimental results with TREC factoid test sets and Chinese test sets indicate that the proposed U-model with redundancy information is very effective for AV. For example, the top@1/mrr@5 scores on the TREC05, and 06 tracks are 40.1/51.5% and 35.8/47.3%, respectively. Furthermore, a cross-model comparison experiment demonstrates that the U-model is the best among the redundancy-based models considered. Even compared with a syntax-based approach, a supervised machine learning approach and a pattern-based approach, the U-model performs much better.
Speaker Weighted Training of HMM Using Multiple Reference Speakers
Hiroaki HATTORI Satoshi NAKAMURA Kiyohiro SHIKANO Shigeki SAGAYAMA

PAPER-Speech Processing

Vol:
E76-D No:2
Page(s):
219-226
This paper proposes a new speaker adaptation method using a speaker weighting technique for multiple reference speaker training of a hidden Markov model (HMM). The proposed method considers the similarities between an input speaker and multiple reference speakers, and use the similarities to control the influence of the reference speakers upon HMM. The evaluation experiments were carried out through the/b, d, g, m, n, N/phoneme recognition task using 8 speakers. Average recognition rates were 68.0%, 66.4%, and 65.6% respectively for three test sets which have different speech styles. These were 4.8%, 8.8%, and 10.5% higher than the rates of the spectrum mapping method, and also 1.6%, 6.7%, and 8.2% higher than the rates of the multiple reference speaker training, the supplemented HMM. The evaluation experiments clarified the effectiveness of the proposed method.
Burst Error Recovery for Huffman Coding
Masato KITAKAMI Satoshi NAKAMURA

LETTER-Algorithm Theory

Vol:
E88-D No:9
Page(s):
2197-2200
Although data compression is popularly used, compressed data have a problem that they are very sensitive to errors. This paper proposes a single burst error recovery method for Huffman coding by using the bidirectionally decodable Huffman coding. Computer simulation shows that the proposed method can recover 2.5lburst bits burst error with high probability, where lburst is the maximum length of burst errors which the proposed method is expected to be able to recover.
Face-to-Talk: Audio-Visual Speech Detection for Robust Speech Recognition in Noisy Environment
Kazumasa MURAI Satoshi NAKAMURA

PAPER-Robust Speech Recognition and Enhancement

Vol:
E86-D No:3
Page(s):
505-513
This paper discusses "face-to-talk" audio-visual speech detection for robust speech recognition in noisy environment, which consists of facial orientation based switch and audio-visual speech section detection. Most of today's speech recognition systems must actually turned on and off by a switch e.g. "push-to-talk" to indicate which utterance should be recognized, and a specific speech section must be detected prior to any further analysis. To improve usability and performance, we have researched how to extract the useful information from visual modality. We implemented a facial orientation based switch, which activates the speech recognition during a speaker is facing to the camera. Then, the speech section is detected by analyzing the image of the face. Visual speech detection is robust to audio noise, but because the articulation starts prior to the speech and lasts longer than the speech, the detected section tends to be longer and ends up with insertion errors. Therefore, we have fused the audio-visual modality detected sections. Our experiment confirms that the proposed audio-visual speech detection method improves recognition performance in noisy environment.
A Hybrid HMM/BN Acoustic Model for Automatic Speech Recognition
Konstantin MARKOV Satoshi NAKAMURA

PAPER-Speech and Speaker Recognition

Vol:
E86-D No:3
Page(s):
438-445
In current HMM based speech recognition systems, it is difficult to supplement acoustic spectrum features with additional information such as pitch, gender, articulator positions, etc. On the other hand, Bayesian Networks (BN) allow for easy combination of different continuous as well as discrete features by exploring conditional dependencies between them. However, the lack of efficient algorithms has limited their application in continuous speech recognition. In this paper we propose new acoustic model, where HMM are used for modeling of temporal speech characteristics and state probability model is represented by BN. In our experimental system based on HMM/BN model, in addition to speech observation variable, state BN has two more (hidden) variables representing noise type and SNR value. Evaluation results on AURORA2 database showed 36.4% word error rate reduction for closed noise test which is comparable with other much more complex systems utilizing effective adaptation and noise robust methods.
Neural Oscillation-Based Classification of Japanese Spoken Sentences During Speech Perception
Hiroki WATANABE Hiroki TANAKA Sakriani SAKTI Satoshi NAKAMURA

PAPER-Biocybernetics, Neurocomputing

Pubricized:
2018/11/14
Vol:
E102-D No:2
Page(s):
383-391
Brain-computer interfaces (BCIs) have been used by users to convey their intentions directly with brain signals. For example, a spelling system that uses EEGs allows letters on a display to be selected. In comparison, previous studies have investigated decoding speech information such as syllables, words from single-trial brain signals during speech comprehension, or articulatory imagination. Such decoding realizes speech recognition with a relatively short time-lag and without relying on a display. Previous magnetoencephalogram (MEG) research showed that a template matching method could be used to classify three English sentences by using phase patterns in theta oscillations. This method is based on the synchronization between speech rhythms and neural oscillations during speech processing, that is, theta oscillations synchronized with syllabic rhythms and low-gamma oscillations with phonemic rhythms. The present study aimed to approximate this classification method to a BCI application. To this end, (1) we investigated the performance of the EEG-based classification of three Japanese sentences and (2) evaluated the generalizability of our models to other different users. For the purpose of improving accuracy, (3) we investigated the performances of four classifiers: template matching (baseline), logistic regression, support vector machine, and random forest. In addition, (4) we propose using novel features including phase patterns in a higher frequency range. Our proposed features were constructed in order to capture synchronization in a low-gamma band, that is, (i) phases in EEG oscillations in the range of 2-50 Hz from all electrodes used for measuring EEG data (all) and (ii) phases selected on the basis of feature importance (selected). The classification results showed that, except for random forest, most classifiers perform similarly. Our proposed features improved the classification accuracy with statistical significance compared with a baseline feature, which is a phase pattern in neural oscillations in the range of 4-8 Hz from the right hemisphere. The best mean accuracy across folds was 55.9% using template matching trained by all features. We concluded that the use of phase information in a higher frequency band improves the performance of EEG-based sentence classification and that this model is applicable to other different users.
FOREWORD
Satoshi NAKAMURA

FOREWORD

Vol:
E85-D No:3
Page(s):
443-443
Construction of Spontaneous Emotion Corpus from Indonesian TV Talk Shows and Its Application on Multimodal Emotion Recognition
Nurul LUBIS Dessi LESTARI Sakriani SAKTI Ayu PURWARIANTI Satoshi NAKAMURA

PAPER-Speech and Hearing

Pubricized:
2018/05/10
Vol:
E101-D No:8
Page(s):
2092-2100
As interaction between human and computer continues to develop to the most natural form possible, it becomes increasingly urgent to incorporate emotion in the equation. This paper describes a step toward extending the research on emotion recognition to Indonesian. The field continues to develop, yet exploration of the subject in Indonesian is still lacking. In particular, this paper highlights two contributions: (1) the construction of the first emotional audio-visual database in Indonesian, and (2) the first multimodal emotion recognizer in Indonesian, built from the aforementioned corpus. In constructing the corpus, we aim at natural emotions that are corresponding to real life occurrences. However, the collection of emotional corpora is notably labor intensive and expensive. To diminish the cost, we collect the emotional data from television programs recordings, eliminating the need of an elaborate recording set up and experienced participants. In particular, we choose television talk shows due to its natural conversational content, yielding spontaneous emotion occurrences. To cover a broad range of emotions, we collected three episodes in different genres: politics, humanity, and entertainment. In this paper, we report points of analysis of the data and annotations. The acquisition of the emotion corpus serves as a foundation in further research on emotion. Subsequently, in the experiment, we employ the support vector machine (SVM) algorithm to model the emotions in the collected data. We perform multimodal emotion recognition utilizing the predictions of three modalities: acoustic, semantic, and visual. When compared to the unimodal result, in the multimodal feature combination, we attain identical accuracy for the arousal at 92.6%, and a significant improvement for the valence classification task at 93.8%. We hope to continue this work and move towards a finer-grain, more precise quantification of emotion.
Class-Dependent Modeling for Dialog Translation
Andrew FINCH Eiichiro SUMITA Satoshi NAKAMURA

PAPER-Speech and Hearing

Vol:
E92-D No:12
Page(s):
2469-2477
This paper presents a technique for class-dependent decoding for statistical machine translation (SMT). The approach differs from previous methods of class-dependent translation in that the class-dependent forms of all models are integrated directly into the decoding process. We employ probabilistic mixture weights between models that can change dynamically on a sentence-by-sentence basis depending on the characteristics of the source sentence. The effectiveness of this approach is demonstrated by evaluating its performance on travel conversation data. We used this approach to tackle the translation of questions and declarative sentences using class-dependent models. To achieve this, our system integrated two sets of models specifically built to deal with sentences that fall into one of two classes of dialog sentence: questions and declarations, with a third set of models built with all of the data to handle the general case. The technique was thoroughly evaluated on data from 16 language pairs using 6 machine translation evaluation metrics. We found the results were corpus-dependent, but in most cases our system was able to improve translation performance, and for some languages the improvements were substantial.
Online EEG-Based Emotion Prediction and Music Generation for Inducing Affective States
Kana MIYAMOTO Hiroki TANAKA Satoshi NAKAMURA

PAPER-Human-computer Interaction

Pubricized:
2022/02/15
Vol:
E105-D No:5
Page(s):
1050-1063
Music is often used for emotion induction because it can change the emotions of people. However, since we subjectively feel different emotions when listening to music, we propose an emotion induction system that generates music that is adapted to each individual. Our system automatically generates suitable music for emotion induction based on the emotions predicted from an electroencephalogram (EEG). We examined three elements for constructing our system: 1) a music generator that creates music that induces emotions that resemble the inputs, 2) emotion prediction using EEG in real-time, and 3) the control of a music generator using the predicted emotions for making music that is suitable for inducing emotions. We constructed our proposed system using these elements and evaluated it. The results showed its effectiveness for inducing emotions and suggest that feedback loops that tailor stimuli to individuals can successfully induce emotions.
An FET Coupled Logic (FCL) Circuit for Multi-Gb/s, Low Power and Low Voltage Serial Interface BiCMOS LSIs
Hitoshi OKAMURA Masaharu SATO Satoshi NAKAMURA Shuji KISHI Kunio KOKUBU

PAPER-Silicon Devices

Vol:
E82-C No:3
Page(s):
531-537
This paper describes a newly developed FET Coupled Logic (FCL) circuit that operates at very high frequencies with very low supply voltages below 3.3 V. An FCL circuit consists of NMOS source-coupled transistor pairs for current switches, load resistors, emitter followers and current sources that are controlled by a band-gap reference bias generator. The characteristics and performance are discussed by comparing this circuit with other high-speed circuits. The optimal circuit parameters for FCL circuits are also discussed, and the fact is noted that a larger swing voltage enhances the circuit's performance. The simulated delay of a 0.25 µm FCL circuit is less than 15 ps for a 2.5 V power supply, and the simulated maximum toggle frequencies are over 5 GHz and 10 GHz at 2.5 V and 3.3 V power supply, respectively. The simulation results show that FCL circuits achieve the best performance among the current mode circuits, which include ECL circuits, NMOS source-coupled logic circuits. The delay of the FCL circuit is less than half that of an ECL circuit. The maximum toggle frequency of the FCL circuit is about triple that of NMOS source-coupled logic circuit. Because the FCL circuit uses low-cost CMOS-based BiCMOS technologies, its cost performance is superior to ECL circuits that require expensive base-emitter self-aligned processes and trench isolation processes. Using depletion-mode NMOS transistors for current switches can lower the minimum supply voltage for FCL circuits and it is below 1.5 V. The FCL circuit is a promising logic gate circuit for multi-Gbit/s tele/data communication LSIs.
Semantically Readable Distributed Representation Learning and Its Expandability Using a Word Semantic Vector Dictionary
Ikuo KESHI Yu SUZUKI Koichiro YOSHINO Satoshi NAKAMURA

PAPER

Pubricized:
2018/01/18
Vol:
E101-D No:4
Page(s):
1066-1078
The problem with distributed representations generated by neural networks is that the meaning of the features is difficult to understand. We propose a new method that gives a specific meaning to each node of a hidden layer by introducing a manually created word semantic vector dictionary into the initial weights and by using paragraph vector models. We conducted experiments to test the hypotheses using a single domain benchmark for Japanese Twitter sentiment analysis and then evaluated the expandability of the method using a diverse and large-scale benchmark. Moreover, we tested the domain-independence of the method using a Wikipedia corpus. Our experimental results demonstrated that the learned vector is better than the performance of the existing paragraph vector in the evaluation of the Twitter sentiment analysis task using the single domain benchmark. Also, we determined the readability of document embeddings, which means distributed representations of documents, in a user test. The definition of readability in this paper is that people can understand the meaning of large weighted features of distributed representations. A total of 52.4% of the top five weighted hidden nodes were related to tweets where one of the paragraph vector models learned the document embeddings. For the expandability evaluation of the method, we improved the dictionary based on the results of the hypothesis test and examined the relationship of the readability of learned word vectors and the task accuracy of Twitter sentiment analysis using the diverse and large-scale benchmark. We also conducted a word similarity task using the Wikipedia corpus to test the domain-independence of the method. We found the expandability results of the method are better than or comparable to the performance of the paragraph vector. Also, the objective and subjective evaluation support each hidden node maintaining a specific meaning. Thus, the proposed method succeeded in improving readability.

41-56hit(56hit)

Author Search Result

[Author] Satoshi NAKAMURA(56hit)

A Physical Design Method for a New Memory-Based Reconfigurable Architecture without Switch Blocks

Construction of Audio-Visual Speech Corpus Using Motion-Capture System and Corpus Based Facial Animation

A Bayesian Model of Transliteration and Its Human Evaluation When Integrated into a Machine Translation System

Using Mutual Information Criterion to Design an Efficient Phoneme Set for Chinese Speech Recognition

An Unsupervised Model of Redundancy for Answer Validation

Speaker Weighted Training of HMM Using Multiple Reference Speakers

Burst Error Recovery for Huffman Coding

Face-to-Talk: Audio-Visual Speech Detection for Robust Speech Recognition in Noisy Environment

A Hybrid HMM/BN Acoustic Model for Automatic Speech Recognition

Neural Oscillation-Based Classification of Japanese Spoken Sentences During Speech Perception

FOREWORD

Construction of Spontaneous Emotion Corpus from Indonesian TV Talk Shows and Its Application on Multimodal Emotion Recognition

Class-Dependent Modeling for Dialog Translation

Online EEG-Based Emotion Prediction and Music Generation for Inducing Affective States

An FET Coupled Logic (FCL) Circuit for Multi-Gb/s, Low Power and Low Voltage Serial Interface BiCMOS LSIs

Semantically Readable Distributed Representation Learning and Its Expandability Using a Word Semantic Vector Dictionary

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles